Feat: elo credit assignment#2
Conversation
…oring details in the review process
…hance scoring logic for improved evaluation
2bf00df to
27a6e48
Compare
…t for cleaner code
|
This looks so cool, i'm excited to test it out! |
|
This PR also includes the MAPrank approach. Magnitude-Adaptive Pairwise Ranking (MAPRank)" - which is designed to uses a combination of magnitude and an adaptive K-factor for pairwise rankings. see Summary of the Magnitude-Adaptive Pairwise Ranking (MAPRank) Approach: This system evaluates applications through a series of pairwise comparisons within distinct "agent" contexts (representing different review models or perspectives). For each matchup: This "how much better" score is translated into an actual_score for each project in the pair (e.g., if A wins with 0.8, A gets 0.8 and B gets 0.2). The effective_K factor is dynamic: it's calculated as BASE_K_FACTOR / sqrt(number_of_opponents_in_tournament). This makes the system more sensitive to individual match outcomes in smaller tournaments and more stable in larger ones, preventing ratings from changing too drastically or bottoming out. This method differs from traditional Elo because it doesn't rely on an "expected score" based on rating differences. Instead, it directly incorporates the magnitude of the perceived difference from the AI judge into the rating updates, providing a more nuanced assessment of relative quality. To run this run: |

Exploration of "Tournament-Style Scoring (Ranked-Choice + Elo) which comes from chess tournaments"
Goal of this PR:
Establish if this could be useful for ranking projects, and differentiating which are the most effective.
Conclusion:
In your implementation, Elo scoring is used to rank grant applications by simulating pairwise comparisons between projects based on their application data, AI-generated reviews, research summaries, karmaGap data and Hypercerts data.
For each matchup, an LLM is prompted to decide which project deserves more funding, and Elo ratings are updated accordingly. After all comparisons, the Elo scores are normalized so they sum to 1, allowing you to proportionally allocate a fixed prize pool (e.g., $25k) based on each project’s relative standing. This approach effectively establishes a funding ranking, but it reflects relative preference—not absolute impact—so high-impact projects may not receive proportionally more funding unless further adjustments are made.
To test this run:
It's good to rank projects, but not suitable in it's current form to answer the question:
This PR also includes the MAPrank approach.
Magnitude-Adaptive Pairwise Ranking (MAPRank)" - which is designed to uses a combination of magnitude and an adaptive K-factor for pairwise rankings.
see
credit-assignment-map-rank.tsSummary of the Magnitude-Adaptive Pairwise Ranking (MAPRank) Approach:
This system evaluates applications through a series of pairwise comparisons within distinct "agent" contexts (representing different review models or perspectives).
For each matchup:
An AI model (the creditAssignmentAgent) determines not just which of the two applications is better, but also how much better it is on a scale from 0.5 (projects are roughly equal) to 1.0 (winner is significantly better).
This "how much better" score is translated into an actual_score for each project in the pair (e.g., if A wins with 0.8, A gets 0.8 and B gets 0.2).
Applications start with a BASE_RATING. Ratings are then updated directly based on the outcome and magnitude of this comparison. The update formula is:
new_rating = old_rating + effective_K * (actual_score_for_project - 0.5)
The effective_K factor is dynamic: it's calculated as BASE_K_FACTOR / sqrt(number_of_opponents_in_tournament). This makes the system more sensitive to individual match outcomes in smaller tournaments and more stable in larger ones, preventing ratings from changing too drastically or bottoming out.
After all pairwise comparisons for an agent are complete, the resulting raw scores are normalized to sum to 1, representing a proportional share (e.g., for funding allocation).
This method differs from traditional Elo because it doesn't rely on an "expected score" based on rating differences. Instead, it directly incorporates the magnitude of the perceived difference from the AI judge into the rating updates, providing a more nuanced assessment of relative quality.
To run this run: